manual feature engineering
AdaRec: Adaptive Recommendation with LLMs via Narrative Profiling and Dual-Channel Reasoning
Wang, Meiyun, Polpanumas, Charin
We propose AdaRec, a few-shot in-context learning framework that leverages large language models for an adaptive personalized recommendation. AdaRec introduces narrative profiling, transforming user-item interactions into natural language representations to enable unified task handling and enhance human readability. Centered on a bivariate reasoning paradigm, AdaRec employs a dual-channel architecture that integrates horizontal behavioral alignment, discovering peer-driven patterns, with vertical causal attribution, highlighting decisive factors behind user preferences. Unlike existing LLM-based approaches, AdaRec eliminates manual feature engineering through semantic representations and supports rapid cross-task adaptation with minimal supervision. Experiments on real ecommerce datasets demonstrate that AdaRec outperforms both machine learning models and LLM-based baselines by up to eight percent in few-shot settings. In zero-shot scenarios, it achieves up to a nineteen percent improvement over expert-crafted profiling, showing effectiveness for long-tail personalization with minimal interaction data. Furthermore, lightweight fine-tuning on synthetic data generated by AdaRec matches the performance of fully fine-tuned models, highlighting its efficiency and generalization across diverse tasks.
OpCode-Based Malware Classification Using Machine Learning and Deep Learning Techniques
Saini, Varij, Gupta, Rudraksh, Soni, Neel
This technical report presents a comprehensive analysis of malware classification using OpCode sequences. Two distinct approaches are evaluated: traditional machine learning using n-gram analysis with Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree classifiers; and a deep learning approach employing a Convolutional Neural Network (CNN). The traditional machine learning approach establishes a baseline using handcrafted 1-gram and 2-gram features from disassembled malware samples. The deep learning methodology builds upon the work proposed in "Deep Android Malware Detection" by McLaughlin et al. and evaluates the performance of a CNN model trained to automatically extract features from raw OpCode data. Empirical results are compared using standard performance metrics (accuracy, precision, recall, and F1-score). While the SVM classifier outperforms other traditional techniques, the CNN model demonstrates competitive performance with the added benefit of automated feature extraction.
Time-aware Metapath Feature Augmentation for Ponzi Detection in Ethereum
Jin, Chengxiang, Zhou, Jiajun, Jin, Jie, Wu, Jiajing, Xuan, Qi
With the development of Web 3.0 which emphasizes decentralization, blockchain technology ushers in its revolution and also brings numerous challenges, particularly in the field of cryptocurrency. Recently, a large number of criminal behaviors continuously emerge on blockchain, such as Ponzi schemes and phishing scams, which severely endanger decentralized finance. Existing graph-based abnormal behavior detection methods on blockchain usually focus on constructing homogeneous transaction graphs without distinguishing the heterogeneity of nodes and edges, resulting in partial loss of transaction pattern information. Although existing heterogeneous modeling methods can depict richer information through metapaths, the extracted metapaths generally neglect temporal dependencies between entities and do not reflect real behavior. In this paper, we introduce Time-aware Metapath Feature Augmentation (TMFAug) as a plug-and-play module to capture the real metapath-based transaction patterns during Ponzi scheme detection on Ethereum. The proposed module can be adaptively combined with existing graph-based Ponzi detection methods. Extensive experimental results show that our TMFAug can help existing Ponzi detection methods achieve significant performance improvements on the Ethereum dataset, indicating the effectiveness of heterogeneous temporal information for Ponzi scheme detection.
Manual Feature Engineering
There is also a complementary Domino project available. Many data scientists deliver value to their organizations by mapping, developing, and deploying an appropriate ML solution to address a business problem. Feature engineering is useful for data scientists when assessing tradeoff decisions regarding the impact of their ML models. It is a framework for approaching ML as well as providing techniques for extracting features from raw data that can be used within the models. As Domino seeks to help data scientists accelerate their work, we reached out to AWP Pearson for permission to excerpt the chapter "Manual Feature Engineering: Manipulating Data for Fun and Profit" from the book, Machine Learning with Python for Everyone by Mark E. Fenner. Many thanks to AWP Pearson for providing the permissions to excerpt the work and enabling us to provide a complementary publicly viewable Domino project. We are going to turn our attention away from expanding our catalog of models [as mentioned previously in the book] and instead take a closer look at the data. Feature engineering refers to manipulation--addition, deletion, combination, mutation--of the features. Remember that features are attribute- value pairs, so we could add or remove columns from our data table and modify values within columns. Feature engineering can be used in a broad sense and in a narrow sense. I'm going to use it in a broad, inclusive sense and point out some gotchas along the way. Two drivers of feature engineering are (1) background knowledge from the domain of the task and (2) inspection of the data values. The first case includes a doctor's knowledge of important blood pressure thresholds or an accountant's knowledge of tax bracket levels. Another example is the use of body mass index (BMI) by medical providers and insurance companies. While it has limitations, BMI is quickly calculated from body weight and height and serves as a surrogate for a characteristic that is very hard to accurately measure: proportion of lean body mass. Inspecting the values of a feature means looking at a histogram of its distribution. For distribution-based feature engineering, we might see multimodal distributions--histograms with multiple humps--and decide to break the humps into bins. A major distinction we can make in feature engineering is when it occurs. Our primary question here is whether the feature engineering is performed inside the cross-validation loop or not.
Automating artificial intelligence for medical decision-making
MIT CSAIL researchers are hoping to accelerate the use of artificial intelligence to improve medical decision-making, by automating a key step that's usually done by hand -- and that's becoming more laborious as certain datasets grow ever-larger. The field of predictive analytics holds increasing promise for helping clinicians diagnose and treat patients. Machine-learning models can be trained to find patterns in patient data to aid in sepsis care, design safer chemotherapy regimens, and predict a patient's risk of having breast cancer or dying in the ICU, to name just a few examples. Typically, training datasets consist of many sick and healthy subjects, but with relatively little data for each subject. Experts must then find just those aspects -- or "features" -- in the datasets that will be important for making predictions.
Automating artificial intelligence for medical decision-making
MIT computer scientists are hoping to accelerate the use of artificial intelligence to improve medical decision-making, by automating a key step that's usually done by hand -- and that's becoming more laborious as certain datasets grow ever-larger. The field of predictive analytics holds increasing promise for helping clinicians diagnose and treat patients. Machine-learning models can be trained to find patterns in patient data to aid in sepsis care, design safer chemotherapy regimens, and predict a patient's risk of having breast cancer or dying in the ICU, to name just a few examples. Typically, training datasets consist of many sick and healthy subjects, but with relatively little data for each subject. Experts must then find just those aspects -- or "features" -- in the datasets that will be important for making predictions.
Automating artificial intelligence for medical decision-making
MIT computer scientists are hoping to accelerate the use of artificial intelligence to improve medical decision-making, by automating a key step that's usually done by hand--and that's becoming more laborious as certain datasets grow ever-larger. The field of predictive analytics holds increasing promise for helping clinicians diagnose and treat patients. Machine-learning models can be trained to find patterns in patient data to aid in sepsis care, design safer chemotherapy regimens, and predict a patient's risk of having breast cancer or dying in the ICU, to name just a few examples. Typically, training datasets consist of many sick and healthy subjects, but with relatively little data for each subject. Experts must then find just those aspects--or "features"--in the datasets that will be important for making predictions.
Where Are We with Computer Vision? - insideBIGDATA
In the past several years, we've witnessed how deep learning, specifically convolutional neural networks, has been successfully applied to computer vision, natural language processing, speech recognition, logistics, online advertising, and many other problem domains. There are a few things that are unique about the application of deep learning to computer vision and understanding these characteristics will help in understanding the state of computer vision. In this article, I'd like to share a nice summary of the state of computer vision from Course 4 "Convolutional Neural Networks" from the new Deep Learning Specialization series on Coursera. Dr. Andrew Ng provides some compelling observations about deep learning and computer vision with the goal of mapping out the future of this increasingly popular technology. Consider that many machine learning problems fall somewhere on the spectrum between where you're working with "small data" to where you have "big data." For example, there is a decent amount of data available for speech recognition.
Towards Wide Learning: Experiments in Healthcare
Banerjee, Snehasis, Chattopadhyay, Tanushyam, Biswas, Swagata, Banerjee, Rohan, Choudhury, Anirban Dutta, Pal, Arpan, Garain, Utpal
In this paper, a Wide Learning architecture is proposed that attempts to automate the feature engineering portion of the machine learning (ML) pipeline. Feature engineering is widely considered as the most time consuming and expert knowledge demanding portion of any ML task. The proposed feature recommendation approach is tested on 3 healthcare datasets: a) PhysioNet Challenge 2016 dataset of phonocardiogram (PCG) signals, b) MIMIC II blood pressure classification dataset of photoplethysmogram (PPG) signals and c) an emotion classification dataset of PPG signals. While the proposed method beats the state of the art techniques for 2nd and 3rd dataset, it reaches 94.38% of the accuracy level of the winner of PhysioNet Challenge 2016. In all cases, the effort to reach a satisfactory performance was drastically less (a few days) than manual feature engineering.